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Abstract 


In this paper, we propose a variety of Long 
Short-Term Memory (LSTM) based mod¬ 
els for sequence tagging. These mod¬ 
els include LSTM networks, bidirectional 
LSTM (BI-LSTM) networks, LSTM with 
a Conditional Random Field (CRF) layer 
(LSTM-CRF) and bidirectional LSTM 
with a CRF layer (BI-LSTM-CRF). Our 
work is the first to apply a bidirectional 
LSTM CRF (denoted as BI-LSTM-CRF) 
model to NLP benchmark sequence tag¬ 
ging data sets. We show that the BI- 
LSTM-CRF model can efficiently use both 
past and future input features thanks to 
a bidirectional LSTM component. It can 
also use sentence level tag information 
thanks to a CRF layer. The BI-LSTM- 
CRF model can produce state of the art (or 
close to) accuracy on POS, chunking and 
NER data sets. In addition, it is robust and 
has less dependence on word embedding 
as compared to previous observations. 

1 Introduction 


Sequence tagging including part of speech tag¬ 
ging (POS), chunking, and named entity recogni¬ 
tion (NER) has been a classic NEP task. It has 
drawn research attention for a few decades. The 
output of taggers can be used for down streaming 
applications. Eor example, a named entity recog¬ 
nizer trained on user search queries can be utilized 
to identify which spans of text are products, thus 
triggering certain products ads. Another example 
is that such tag information can be used by a search 
engine to find relevant webpages. 

Most existing sequence tagging models are 
linear statistical models which include Hid¬ 
den Markov Models (HMM), Maximum entropy 


Markov models (MEMMs) (McCallum et al.. 


20001, and Conditional Random Eields (CRE) 


(Eafferty et al., 20011. Convolutional network 
based models ( Collobert et al., 2011 1 have been re¬ 
cently proposed to tackle sequence tagging prob¬ 
lem. We denote such a model as Conv-CRF as 
it consists of a convolutional network and a CRE 
layer on the output (the term of sentence level log- 
likelihood (SSL) was used in the original paper). 
The Conv-CRE model has generated promising re¬ 
sults on sequence tagging tasks. In speech lan¬ 
guage understanding community, recurrent neural 
network ( jMesnil et al., 2013[|Yao et al., 2014| ) and 
convolutional nets (Xu and Sarrkaya, 20131 based 
models have been recently proposed. Other rele¬ 
vant work includes ( [Graves et al., 2005] [Graves et 


al., 20131 which proposed a bidirectional recurrent 


neural network for speech recognition. 

In this paper, we propose a variety of neural 
network based models to sequence tagging task. 
These models include ESTM networks, bidirec¬ 
tional ESTM networks (BI-ESTM), ESTM net¬ 
works with a CRE layer (ESTM-CRE), and bidi¬ 
rectional ESTM networks with a CRE layer (BI- 
ESTM-CRE). Our contributions can be summa¬ 
rized as follows. 1) We systematically com¬ 
pare the performance of aforementioned models 
on NLP tagging data sets; 2) Our work is the 
first to apply a bidirectional LSTM CRE (denoted 
as BI-ESTM-CRE) model to NEP benchmark se¬ 
quence tagging data sets. This model can use both 
past and future input features thanks to a bidirec¬ 
tional ESTM component. In addition, this model 
can use sentence level tag information thanks to 
a CRE layer. Our model can produce state of 
the art (or close to) accuracy on POS, chunking 
and NER data sets; 3) We show that BI-ESTM- 
CRE model is robust and it has less dependence 
on word embedding as compared to previous ob¬ 
servations (Collobert et al., 20111. It can produce 
accurate tagging performance without resorting to 
word embedding. 

The remainder of the paper is organized as fol¬ 
lows. Section [^describes sequence tagging mod- 
















els used in this paper. Sectionj^shows the training 
procedure. Section reports the experiments re¬ 
sults. Section|^discusses related research. Finally 
Sectionl^draws conclusions. 

2 Models 

In this section, we describe the models used in this 
paper: LSTM, BI-LSTM, CRF, LSTM-CRF and 
BI-LSTM-CRF. 


2.1 LSTM Networks 


Recurrent neural networks (RNN) have been em¬ 
ployed to produce promising results on a variety 


of tasks including language model (Mikolov et ah, 


2010t Mikolov et ah, 20111 and speech recogni¬ 


tion ( [Graves et ah, 2005] ). A RNN maintains a 
memory based on history information, which en¬ 
ables the model to predict the current output con¬ 
ditioned on long distance features. 

Figure [T] shows the RNN structure (Elman, 


19901 which has an input layer x, hidden layer 
h and output layer y. In named entity tag¬ 
ging context, X represents input features and y 
represents tags. Figure [T] illustrates a named 
entity recognition system in which each word 
is tagged with other (O) or one of four entity 
types: Person (PER), Location (LOC), Organi¬ 
zation (ORG), and Miscellaneous (MISC). The 
sentence of EU rejects German call to 
boycott British lamb . is tagged 

as B-ORG 0 B-MISC 0 0 0 B-MISC 0 0, 


where B-, I- tags indicate beginning and interme¬ 
diate positions of entities. 

An input layer represents features at time t. 
They could be one-hot-encoding for word feature, 
dense vector features, or sparse features. An input 
layer has the same dimensionality as feature size. 
An output layer represents a probability distribu¬ 
tion over labels at time t. It has the same dimen¬ 
sionality as size of labels. Compared to feedfor¬ 
ward network, a RNN introduces the connection 
between the previous hidden state and current hid¬ 
den state (and thus the recurrent layer weight pa¬ 
rameters). This recurrent layer is designed to store 
history information. The values in the hidden and 
output layers are computed as follows: 


h{t) = /(Ux(f)+Wh(f-1)), (1) 

y(f) = gi\h{t)), (2) 


where U, W, and V are the connection weights to 
be computed in training time, and f(z) and g(z) 


are sigmoid and softmax activation functions as 
follows. 


f{z) = 

1 

(3) 

1 

g{zm) = 


(4) 
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EU rejects German call 


Eigure 1: A simple RNN model. 


In this paper, we apply Long Short-Term Mem¬ 
ory ([Hochreiter and Schmidhuber, 1997 [Graves 


et ah, 20051 to sequence tagging. Long Short- 


Term Memory networks are the same as RNNs, 
except that the hidden layer updates are replaced 
by purpose-built memory cells. As a result, they 
may be better at finding and exploiting long range 
dependencies in the data. Eig. [^illustrates a sin¬ 
gle LSTM memory cell ( Graves et ah, 2005] ). The 


Xl 



Eigure 2: A Long Short-Term Memory Cell. 

LSTM memory cell is implemented as the follow¬ 
ing: 

it — 'tr(]VxiXt WciCt—i -I- bi) 

ft = a{WxfXt + Whfht-i-\-WcfCt-i + bf) 

ct = ftct-i + ktanhfWxcXt + Whcht-i + be) 
Ot — trifMxoXt T ^^hob't—i )McoCt -t- bo) 
ht = ottanh{ct) 




















































where a is the logistic sigmoid function, and i, /, 
o and c are the input gate, forget gate, output gate 
and cell vectors, all of which are the same size as 
the hidden vector h. The weight matrix subscripts 
have the meaning as the name suggests. For ex¬ 
ample, Wfii is the hidden-input gate matrix, Wxo 
is the input-output gate matrix etc. The weight ma¬ 
trices from the cell to gate vectors (e.g. Wd) are 
diagonal, so element m in each gate vector only 
receives input from element m of the cell vector. 

Fig. shows a LSTM sequence tagging model 
which employs aforementioned LSTM memory 
cells (dashed boxes with rounded corners). 
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EU rejects German call 


Figure 3: A LSTM network. 


2.2 Bidirectional LSTM Networks 


In sequence tagging task, we have access to both 
past and future input features for a given time, 
we can thus utilize a bidirectional LSTM network 
(Figure 1^ as proposed in ( [Graves et ah, 2013 1. In 
doing so, we can efficiently make use of past fea¬ 
tures (via forward states) and future features (via 
backward states) for a specific fime frame. We 
frain bidirecfional LSTM nefworks using back- 


propagafion fhrough fime (BPTT)( |Boden., 2002] ). 
The forward and backward passes over fhe un¬ 
folded nefwork over lime are carried oul in a sim¬ 
ilar way lo regular nefwork forward and backward 
passes, excepf fhal we need lo unfold fhe hidden 
slales for all lime steps. We also need a special 
Irealmenl al fhe beginning and fhe end of fhe dala 
poinls. In our implemenlafion, we do forward and 
backward for whole senlences and we only need lo 
resel fhe hidden slales lo 0 al Ihe begging of each 
sentence. We have balch implemenlalion which 
enables mulliple sentences lo be processed al Ihe 
same lime. 


2.3 CRF networks 

There are two different ways to make use of neigh¬ 
bor tag information in predicting current tags. The 
first is to predict a distribution of tags for each time 
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Figure 4: A bidirectional LSTM network. 


step and then use beam-like decoding to find opti¬ 
mal tag sequences. The work of maximum entropy 
classifier (jRatnaparkhi, 1996]) and Maximum en¬ 


tropy Markov models (MEMMs) (McCallum et 


ah, 20001 fall in this category. The second one is to 


focus on sentence level instead of individual posi¬ 
tions, thus leading to Conditional Random Fields 
(CRF) models ( jLafferty et ah, 2001] ) (Fig. |^. Note 
that the inputs and outputs are directly connected, 
as opposed to LSTM and bidirectional LSTM net¬ 
works where memory cells/recurrent components 
are employed. 

It has been shown that CRFs can produce higher 
tagging accuracy in general. It is interesting that 
the relation between these two ways of using tag 
information bears resemblance to two ways of us¬ 
ing input features (see aforementioned LSTM and 
BI-LSTM networks), and the results in this paper 
confirms the superiority of BI-LSTM compared to 
LSTM. 
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EU rejects German call 


Figure 5: A CRF network. 


2.4 LSTM-CRF networks 

We combine a LSTM network and a CRF network 
to form a LSTM-CRF model, which is shown in 
Fig. This network can efficiently use past input 
features via a LSTM layer and sentence level tag 
information via a CRF layer. A CRF layer is repre¬ 
sented by lines which connect consecutive output 
layers. A CRF layer has a state transition matrix as 
parameters. With such a layer, we can efficiently 
use past and future tags to predict the current tag. 
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which is similar to the use of past and future input 
features via a bidirectional LSTM network. We 
consider the matrix of scores f8{[x]'f) are output 
by the network. We drop the input Nf for nota¬ 
tion simplification. The element [fs]i^t of the ma¬ 
trix is the score output by the network with param¬ 
eters 9, for the sentence [x]^ and for the f-th tag, 
at the f-th word. We introduce a transition score 
[A\ij to model the transition from z-th state to j- 
th for a pair of consecutive time steps. Note that 
this transition matrix is position independent. We 
now denote the new parameters for our network as 
9 = 9yj{[A\ij\/i, j}. The score of a sentence [x]f 
along with a path of tags [i]f is then given by the 
sum of transition scores and network scores: 


t=i 

(5) 

The dynamic programming ( [Rabiner, 1989| l can be 
used efficiently to compute [A]ij and optimal tag 
sequences for inference. See (Lafferty et ah, 20011 
for details. 
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Figure 6: A LSTM-CRF model. 


2.5 BI-LSTM-CRF networks 

Similar to a LSTM-CRF network, we combine a 
bidirectional LSTM network and a CRF network 
to form a BI-LSTM-CRF network (Fig. [^. In ad¬ 
dition to the past input features and sentence level 
tag information used in a LSTM-CRF model, a BI- 
LSTM-CRF model can use the future input fea¬ 
tures. The extra features can boost tagging accu¬ 
racy as we will show in experiments. 

3 Training procedure 
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Figure 7: A BI-LSTM-CRF model. 


whole training data to batches and process one 
batch at a time. Each batch contains a list of 
sentences which is determined by the parameter 
of batch size. In our experiments, we use batch 
size of 100 which means to include sentences 
whose total length is no greater than 100. For 
each batch, we first run bidirectional LSTM-CRF 
model forward pass which includes the forward 
pass for both forward state and backward state of 
LSTM. As a result, we get the the output score 
fe{[x]J) for all tags at all positions. We then run 
CRF layer forward and backward pass to compute 
gradients for network output and state transition 
edges. After that, we can back propagate the er¬ 
rors from the output to the input, which includes 
the backward pass for both forward and backward 
states of LSTM. Finally we update the network pa¬ 
rameters which include the state transition matrix 
[A]ijyi,j, and the original bidirectional LSTM 
parameters 9. 


Algorithm 1 Bidirectional LSTM CRF model 
training procedure 

1 : for each epoch do 
2: for each batch do 

3: 1) bidirectional LSTM-CRF model forward pass: 

4: forward pass for forward state LSTM 

5: forward pass for backward state LSTM 

6: 2) CRF layer forward and backward pass 

7: 3) bidirectional LSTM-CRF model backward pass: 

8: backward pass for forward state LSTM 

9: backward pass for backward state LSTM 

10: 4) update parameters 

11: end for 

12: end for 


4 Experiments 
4.1 Data 


All models used in this paper share a generic SGD 
forward and backward training procedure. We 
choose the most complicated model, BI-LSTM- 
CRF, to illustrate the training algorithm as shown 
in Algorithm [T] In each epoch, we divide the 


We test LSTM, BI-LSTM, CRF, LSTM-CRF, 
and BI-LSTM-CRF models on three NLP tag¬ 
ging tasks: Penn TreeBank (PTB) POS tagging, 
CoNLL 2000 chunking, and CoNLL 2003 named 
entity tagging. Table [T] shows the size of sen- 












































tences, tokens, and labels for training, validation 
and test sets respectively. 

POS assigns each word with a unique tag that 
indicates its syntactic role. In chunking, each word 
is tagged with its phrase type. For example, tag 
B-NP indicates a word starting a noun phrase. In 
NER task, each word is tagged with other or one of 
four entity types: Person, Location, Organization, 
or Miscellaneous. We use the BI02 annotation 
standard for chunking and NER tasks. 


4.2 Features 


We extract the same types of features for three data 
sets. The features can be grouped as spelling fea¬ 
tures and context features. As a result, we have 
40IK, 76K, and 341K features extracted for POS, 
chunking and NER data sets respectively. These 
features are similar to the features extracted from 
Stanford NER tool ( Einkel et al., 2005t Wang and 
Manning, 201 3||. Note that we did not use extra 


data for POS and chunking tasks, with the ex¬ 
ception of using Senna embedding (see Section 
4.2.3| ). Eor NER task, we report performance with 
spelling and context features, and also incremen¬ 
tally with Senna embedding and Gazetteer fea- 
turefl 


4.2.1 Spelling features 

We extract the following features for a given word 
in addition to the lower case word features. 

• whether start with a capital letter 

• whether has all capital letters 

• whether has all lower case letters 

• whether has non initial capital letters 

• whether mix with letters and digits 

• whether has punctuation 

• letter prefixes and suffixes (with window size 
of 2 to 5) 

• whether has apostrophe end (’s) 

• letters only, for example, I. B. M. to IBM 

• non-letters only, for example, A. T. &T. to ..& 

• word pattern feature, with capital letters, 
lower case letters, and digits mapped to ‘A, 
‘a’ and ‘0’ respectively, for example, D56y-3 
to AOOa-0 

• word pattern summarization feature, similar 
to word pattern feature but with consecutive 
identical characters removed. Eor example, 
D56y-3 to AOa-0 


4.2.2 Context features 

Eor word features in three data sets, we use uni¬ 
gram features and bi-grams features. Eor POS fea¬ 
tures in CoNEE2000 data set and POS & CHUNK 
features in CoNEE2003 data set, we use unigram, 
bi-gram and tri-gram features. 


4.2.3 Word embedding 

It has been shown in ( |Collobert et al., 201 1| | that 
word embedding plays a vital role to improve se¬ 
quence tagging performance. We downloadecj^the 
embedding which has 130K vocabulary size and 
each word corresponds to a 50-dimensional em¬ 
bedding vector. To use this embedding, we simply 
replace the one hot encoding word representation 
with its corresponding 50-dimensional vector. 


4.2.4 Features connection tricks 


We can treat spelling and context features the same 
as word features. That is, the inputs of networks 
include both word, spelling and context features. 
However, we find thaf direct connections from 
spelling and context features to outputs accelerate 
training and they result in very similar tagging ac¬ 
curacy. Fig. illustrates this network in which 
features have direct connections to outputs of net¬ 
works. We will report all tagging accuracy using 
this connection. We note that this usage of features 
has the same flavor of Maximum Enfropy fealures 
as used in ([Mikolov ef al., 201 1|). The difference 


is thaf features collision may occur in (Mikolov et 


al., 20111 as feature hashing technique has been 
adopted. Since the output labels in sequence tag¬ 
ging data sets are less than that of language model 
(usually hundreds of thousands), we can afford to 
have full connections between features and out¬ 
puts to avoid potential feature collisions. 
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Figure 8: A BI-ESTM-CRF model with MaxEnt 
features. 


'Downloaded from http://ronan.collobert.com/senna/ 


^http://ronan.collobert.com/senna/ 































Table 1: Size of sentences, tokens, and labels for training, validation and test sets. 




POS 

CoNLL2000 

CoNLL2003 

training 

sentence # 
token # 

39831 

950011 

8936 

211727 

14987 

204567 

validation 

sentence # 
token # 

1699 

40068 

N/A 

N/A 

3466 

51578 

test 

sentences # 
token # 

2415 

56671 

2012 

47377 

3684 

46666 


label # 

45 

22 

9 


4.3 Results 


We train LSTM, BI-LSTM, CRF, LSTM-CRF and 
BI-LSTM-CRF models for each data set. We 
have two ways to initialize word embedding: Ran¬ 
dom and Senna. We randomly initialize the word 
embedding vectors in the first category, and use 
Senna word embedding in the second category. 
For each category, we use identical feature sets, 
thus different results are solely due to different 
networks. We train models using training data 
and monitor performance on validation data. As 
chunking data do not have a validation data set, 
we use part of training data for validation purpose. 

We use a learning rate of 0.1 to train models. 
We set hidden layer size to 300 and found that 
model performance is not sensitive to hidden layer 
sizes. The training for three tasks require less than 
10 epochs to converge and it in general takes less 
than a few hours. We report models’ performance 
on test datasets in Table which also lists the 
best results in ( [Collobert et al., 201 1[ ), denoted as 
Conv-CRF. The POS task is evaluated by comput¬ 
ing per-word accuracy, while the chunk and NER 
tasks are evaluated by computing FI scores over 
chunks. 


4.3.1 Comparison with Cov-CRF networks 

We have three baselines: LSTM, BI-LSTM and 
CRL. LSTM is the weakest baseline for all three 
data sets. The BI-LSTM performs close to CRL 
on POS and chunking datasets, but is worse than 
CRL on NER data set. The CRL forms strong 
baselines in our experiments. Lor random cate¬ 
gory, CRL models outperform Conv-CRF models 
for all three data sets. For Senna category, CRFs 
outperform Conv-CRF for POS task, while under¬ 
perform for chunking and NER task. LSTM-CRL 
models outperform CRL models for all data sets 
in both random and Senna categories. This shows 
the effectiveness of the forward state LSTM com¬ 
ponent in modeling sequence data. The BI-LSTM- 
CRL models further improve LSTM-CRL models 


and they lead to the best tagging performance for 
all cases except for POS data at random category, 
in which LSTM-CRL model is the winner. The 
numbers in parentheses for CoNLL 2003 under 
Senna categories are generated with Gazetteer fea¬ 
tures. 

It is interesting that our best model BI-LSTM- 
CRL has less dependence on Senna word em¬ 
bedding compared to Conv-CRL model. Lor ex¬ 
ample, the tagging difference between BI-LSTM- 
CRL model for random and Senna categories are 
0.12%, 0.33%, and 4.57% for POS, chunking and 
NER data sets respectively. In contrast, the Conv- 
CRL model heavily relies on Senna embedding to 
get good tagging accuracy. It has the tagging dif¬ 
ference of 0.92%, 3.99% and 7.20% between ran¬ 
dom and Senna category for POS, chunking and 
NER data sets respectively. 

4.3.2 Model robustness 

To estimate the robustness of models with respect 
to engineered features (spelling and context fea¬ 
tures), we train LSTM, BI-LSTM, CRL, LSTM- 
CRL, and BI-LSTM-CRL models with word fea¬ 
tures only (spelling and context features removed). 
Table shows tagging performance of proposed 
models for POS, chunking, and NER data sets 
using Senna word embedding. The numbers in 
parentheses indicate the performance degradation 
compared to the same models but using spelling 
and context features. CRL models’ performance is 
significantly degraded with the removal of spelling 
and context features. This reveals the fact that 
CRL models heavily rely on engineered features 
to obtain good performance. On the other hand, 
LSTM based models, especially BI-LSTM and 
BI-LSTM-CRL models are more robust and they 
are less affected by the removal of engineering fea¬ 
tures. Lor all three tasks, BI-LSTM-CRL mod¬ 
els result in the highest tagging accuracy. Lor 
example. It achieves the LI score of 94.40 for 
CoNLL2000 chunking, with slight degradation 















Table 2: Comparison of tagging performance on POS, chunking and NER tasks for various models. 




POS 

CoNFF2000 

CoNFF2003 


Conv-CRF ( 

Colloberl el al., 2011 

1 

96.37 

90.33 

81.47 


FSTM 


97.10 

92.88 

79.82 


BI-FSTM 


97.30 

93.64 

81.11 

Random 

CRF 


97.30 

93.69 

83.02 


FSTM-CRF 

97.45 

93.80 

84.10 


BI-FSTM-CRF 

97.43 

94.13 

84.26 


Conv-CRF ( 

Colloberl el al., 2011 

1 

97.29 

94.32 

88.67 (89.59) 


FSTM 


97.29 

92.99 

83.74 


BI-FSTM 


97.40 

93.92 

85.17 

Senna 

CRF 


97.45 

93.83 

86.13 


FSTM-CRF 

97.54 

94.27 

88.36 


BI-FSTM-CRF 

97.55 

94.46 

88.83 (90.10) 


(0.06) compared to the same model but using 
spelling and context features. 

4.3.3 Comparison with existing systems 

For POS data set, we achieved state of the art tag¬ 
ging accuracy with or without the use of extra data 
resource. POS data set has been extensively tested 
and the past improvement can be realized in Table 
Our test accuracy is 97.55% which is signifi¬ 
cantly better than others in the confidence level of 
95%. In addifion, our BI-LSTM-CRF model al¬ 
ready reaches a good accuracy wifhouf fhe use of 
the Senna embedding. 

All chunking systems performance is shown in 
table [5] Kudo et al. won the CoNFF 2000 chal¬ 
lenge with a FI score of 93.48%. Their approach 
was a SVM based classifier. They lafer improved 
the results up to 93.91%. Recent work include the 


CRF based models 

Sha and Pereira, 2003 Me- 

donald el al., 2005 

Sun el al., 20081. More re- 

cenf is ( 

Shen and Sarkar, 2005 

I which oblained 


95.23% accuracy with a voting classifier scheme, 
where each classifier is frained on differenf fag 
represenfafions (lOB, lOE, efc.). Our model ouf- 


performs all reported sysfems excepf (Shen and 
Sarkar, 2005| |. 


The performance of all sysfems for NER is 
shown in fable ( [Florian ef ah, 2003 1 pre- 
senfed fhe besf sysfem af fhe NER CoNFF 2003 
challenge, wifh 88.76% FI score. They used a 
combinafion of various machine-learning classi¬ 
fiers. The second besf performer of CoNFF 2003 
(Chieu., 20031 was 88.31% FI, also wifh fhe help 


of an external gazetteer. Fafer, (Ando and Zhang., 


20051 reached 89.31% FI wifh a semi-supervised 


approach. The besf FI score of 90.90% was re¬ 


new form of learning word embeddings fhaf can 
leverage informalion from relevanf lexicons fo im¬ 
prove fhe represenfafions. Our model can achieve 
fhe besf FI score of 90.10 wifh bofh Senna em¬ 
bedding and gazetteer fealures. If has a lower 
FI score fhan ( Passos ef ah, 201^ , which may 
be due fo fhe facf fhaf differenf word embeddings 
were employed. Wifh fhe same Senna embedding, 
BI-FSTM-CRF slighfly oufperforms Conv-CRF 
(90.10% vs. 89.59%). However, BI-FSTM-CRF 
significanfly oufperforms Conv-CRF (84.26% vs. 
81.47%) if random embedding is used. 

5 Discussions 

Our work is close fo fhe work of ([Collobert ef ah. 


20111 as bofh of fhem ufilized deep neural nef- 
works for sequence lagging. While Iheir work 
used convolufional neural nefworks, ours used bi¬ 
directional FSTM nefworks. 


Our work is also close fo fhe work of (Ham 


[merfon, 2003 Yao ef al., 2014 1 as all of fhem em¬ 
ployed FSTM nelwork for lagging. The perfor¬ 
mance in ( Hammerfon, 2003^ was nol impressive. 
The work in ( Yao ef al., 2014| ) did nol make use of 
bidireclional FSTM and CRF layers and Ihus fhe 
lagging accuracy may be suffered. 

Finally, our work is related fo fhe work of 


ported in (Passos el al., 20141 which employed a 


(Wang and Manning, 20131 which concluded fhaf 
non-linear archileclure offers no benefils in a high¬ 
dimensional discrefe fealure space. We showed 
fhaf wifh fhe bi-directional FSTM CRF model, we 
consislenlly oblained better lagging accuracy fhan 
a single CRF model wifh idenlical fealure sefs. 

6 Conclusions 

In Ihis paper, we syslemalically compared fhe per¬ 
formance of FSTM nefworks based models for se- 






























































Table 3: Tagging performance on POS, chunking and NER tasks with only word features. 




POS 

CoNFF2000 

CoNFF2003 

Senna 

FSTM 

BI-FSTM 

CRF 

FSTM-CRF 

BI-FSTM-CRF 

94.63 (-2.66) 
96.04 (-1.36) 
94.23 (-3.22) 
95.62 (-1.92) 
96.11 (-1.44) 

90.11 (-2.88) 
93.80 (-0.12) 
85.34 (-8.49) 
93.13 (-1.14) 
94.40 (-0.06) 

75.31 (-8.43) 
83.52 (-1.65) 
77.41 (-8.72) 
81.45 (-6.91) 
84.74 (-4.09) 


Table 4: Comparison of tagging accuracy of different models for POS. 


System 

accuracy 

extra data 

Maximum entropy cyclic dependency 
network (Toutanova et al., 20031 

97.24 

No 

SVM-based tagger (Oimenez and Marquez, 2004 1 

97.16 

No 

Bidirectional perceptron learning (Shen et al., 200 /1 

97.33 

No 

Semi-supervised condensed nearest neighbor 
(Soegaard, 2011 1 

97.50 

Yes 

CRT’s with structure regularization (Sun, 20141 

97.36 

No 

Conv network tagger (Collobert et al., 20111 

96.37 

No 

Conv network tagger (senna) (Collobert et al., 20111 

97.29 

Yes 

BI-FSTM-CRF (ours) 

97.43 

No 

BI-FSTM-CRF (Senna) (ours) 

97.55 

Yes 


Table 5: Comparison of FI scores of different models for chunking. 


System 


SVM classifier (Kudo and Mafsumoto, 2000 

1 

SVM classifier ^Kudo and Mafsumoto, 2001 

Second order CRF ( 

Sha and Pereira, 20031 


Specialized HMM -i- voting scheme 

Shen and Sarkar, 20051 

Second order CRF ( 
Second order CRF ( 

Mcdonafd el al., 20051 


Sun el al., 2008 


Conv-CRF (Coffobert et at., 20111 


Conv network tagger (senna) (Collobert et at., 20111 


BI-LSTM-CRF (ours) 
BI-FSTM-CRF (Senna) (ours) 


accuracy 

93.48 

93.91 

94.30 

95.23 

94.29 

94.34 

90.33 

94.32 

94.13 

94.46 


Table 6: Comparison of FI scores of different models for NER. 


System 

accuracy 

Combination of HMM, Maxent etc. (iFlorian et al., 2003 

1 

88.76 

88.31 

89.31 
81.47 
89.59 
90.90 

MaxEnt classifier (Chieu., 20031 



Semi-supervised model combination (Ando and Zhang., 20051 

Conv-CRF (Collobert et al., 201 

h 


Conv-CRF (Senna -i- Cazetteer) ( 
CRF with Fexicon Infused Embe 

Collobert et al., 201 it) 

ddmgs (Passos et al., 2014 1 

BI-FSTM-CRF (ours) 

BI-FSTM-CRF (Senna -i- Gazetteer) (ours) 

84.26 

90.10 


quence tagging. We presented the first work of 
applying a BI-FSTM-CRF model to NFP bench¬ 
mark sequence tagging data. Our model can pro¬ 
duce state of the art (or close to) accuracy on 
POS, chunking and NER data sets. In addition, 


our model is robust and it has less dependence on 
word embedding as compared to the observation 
in ( |CoIIobert et al., 201 1] ). It can achieve accu¬ 
rate tagging accuracy without resorting to word 
embedding. 
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